Random Balance: Ensembles of variable priors classifiers for imbalanced data

نویسندگان

  • José-Francisco Díez-Pastor
  • Juan José Rodríguez Diez
  • César Ignacio García-Osorio
  • Ludmila I. Kuncheva
چکیده

In Machine Learning, a data set is imbalanced when the class proportions are highly skewed. Imbalanced data sets arise routinely in many application domains and pose a challenge to traditional classifiers. We propose a new approach to building ensembles of classifiers for two-class imbalanced data sets, called Random Balance. Each member of the Random Balance ensemble is trained with data sampled from the training set and augmented by artificial instances obtained using SMOTE. The novelty in the approach is that the proportions of the classes for each ensemble member are chosen randomly. The intuition behind the method is that the proposed diversity heuristic will ensure that the ensemble contains classifiers that are specialized for different operating points on the ROC space, thereby leading to larger AUC compared to other ensembles of classifiers. Experiments have been carried out to test the Random Balance approach by itself, and also in combination with standard ensemble methods. As a result, we propose a new ensemble creation method called RB-Boost which combines Random Balance with AdaBoost.M2. This combination involves enforcing random class proportions in addition to instance re-weighting. Experiments with 86 imbalanced data sets from two well known repositories demonstrate the advantage of the Random Balance approach. The class-imbalance problem occurs when there are many more instances of some classes than others [1]. Imbalanced data sets are common in fields such as bioinformatics (translation initiation site (TIS) recognition in DNA sequences [2], gene recognition [3]), engineering (non-destructive testing in weld flaws detection through visual inspection [4]), finance (predicting credit card customer churn [5]), fraud detection [6] and many more. Bespoke methods are needed for imbalanced classes for at least three reasons [7]. Firstly, standard classifiers are driven by accuracy so the minority class may be ignored. Secondly, standard classification methods operate under the assumption that the data sample is a faithful representation of the population of interest, which is not always the case with imbalanced problems. Finally, the classification methods for imbalanced problems should allow for errors coming from different classes to have different costs. Galar et al. [8] systemize the wealth of recent techniques and approaches into four categories: (a) Algorithm level approaches. This category contains variants of existing classifier learning algorithms biased towards learning more accurately the minority class. Examples include decision tree algorithms insensitive to the class sizes, like Hellinger Distance Decision Tree (HDDT) [9], Class Confidence Proportion Decision Tree (CCPDT) [10] …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Model Trees and Their Ensembles for Imbalanced Data

Model trees are decision trees with linear regression functions at the leaves. Although originally proposed for regression, they have also been applied successfully in classification problems. This paper studies their performance for imbalanced problems. These trees give better results that standard decision trees (J48, based on C4.5) and decision trees specific for imbalanced data (CCPDT: Clas...

متن کامل

Ensembles of (α)-Trees for Imbalanced Classification Problems

This paper introduces two kinds of decision tree ensembles for imbalanced classification problems, extensively utilizing properties of α-divergence. First, a novel splitting criterion based on α-divergence is shown to generalize several wellknown splitting criteria such as those used in C4.5 and CART. When the α-divergence splitting criterion is applied to imbalanced data, one can obtain decisi...

متن کامل

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...

متن کامل

Improvement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination

Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...

متن کامل

Applicability of Roughly Balanced Bagging for Complex Imbalanced Data

Roughly Balanced Bagging is based on under-sampling and classifies imbalanced data much better than other ensembles. In this paper, we experimentally study its properties that may influence its good performance. Results of experiments show that it can be constructed with a small number of component classifiers, which are quite accurate, however, of low diversity. Moreover, its good performance ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Knowl.-Based Syst.

دوره 85  شماره 

صفحات  -

تاریخ انتشار 2015